Skip to content

[Reduce_then_scan refactor pt 2] Relaxing requirement subgroup size#2657

Open
danhoeflinger wants to merge 21 commits into
dev/dhoeflin/enable_reduce_then_scan_everywherefrom
dev/dhoeflin/remove_subgroup_size_requirement
Open

[Reduce_then_scan refactor pt 2] Relaxing requirement subgroup size#2657
danhoeflinger wants to merge 21 commits into
dev/dhoeflin/enable_reduce_then_scan_everywherefrom
dev/dhoeflin/remove_subgroup_size_requirement

Conversation

@danhoeflinger
Copy link
Copy Markdown
Contributor

@danhoeflinger danhoeflinger commented Apr 7, 2026

Relaxes the requirement of subgroup size 32 / 16 for reduce_then_scan (without sacrificing performance).

  • Remove the compile-time __sub_group_size template parameter from scan building blocks, replacing it with a runtime query via sub_group::get_max_local_range() to support arbitrary sub-group sizes. For compilers which support sycl::reqd_sub_group_size this can be treated in practice as a constexpr to enable optimizations anyway
  • Remove helpers to hardcode sub group size and to determine if the required subgroup size is available. Remove gating around reduce_then_scan (except for in cases of output limited checks).
  • Replace [[sycl::reqd_sub_group_size(...)]] with [[_ONEDPL_SYCL_REQD_SUB_GROUP_SIZE_IF_SUPPORTED(32)]] to allow the kernel to run on devices that don't support sub-group size 32
  • Limit workgroup size on CPU to 256, rather than 8k. This is a big difference in performance for reduce_then_scan on CPU targets (along with SLM implementations rather than subgroup communication).

Full picture:

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR continues the reduce_then_scan refactor by removing hard-coded compile-time sub-group sizes (e.g., 32/16) and expanding applicability of the reduce-then-scan pattern across more devices (including CPU), while attempting to preserve performance via runtime sub-group queries and adjusted work-group sizing.

Changes:

  • Removes the device capability gating around reduce_then_scan and switches several algorithms to always use it (with remaining gating only for limited-output cases).
  • Refactors sub-group scan building blocks to query sub-group sizing at runtime and updates downstream KT utilities to match the new API.
  • Adjusts CPU work-group sizing caps and communication strategy (favoring SLM-based comms on CPU / non-trivially-copyable types).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h Removes gating/fallbacks so more scan/copy/set operations use reduce-then-scan by default.
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h Removes compile-time sub-group size template params; adds runtime device sub-group size queries and new work-group caps.
include/oneapi/dpl/experimental/kt/internal/sub_group/sub_group_scan.h Updates KT sub-group scan wrapper calls to the new reduce-then-scan scan primitive signatures.
include/oneapi/dpl/experimental/kt/internal/cooperative_lookback.h Updates cooperative lookback’s use of sub-group scan primitives to match new templates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h Outdated
Comment thread include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h
@danhoeflinger danhoeflinger added this to the 2022.13.0 milestone Apr 13, 2026
@danhoeflinger danhoeflinger force-pushed the dev/dhoeflin/remove_subgroup_size_requirement branch from 191a9e3 to 1c2fcc1 Compare April 13, 2026 19:32
@danhoeflinger danhoeflinger marked this pull request as ready for review May 6, 2026 13:23
@danhoeflinger danhoeflinger force-pushed the dev/dhoeflin/enable_reduce_then_scan_everywhere branch from 5cd54c4 to c32888d Compare May 7, 2026 02:47
@danhoeflinger danhoeflinger force-pushed the dev/dhoeflin/remove_subgroup_size_requirement branch from 7dc127d to 4a1560b Compare May 7, 2026 02:47
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
@danhoeflinger danhoeflinger force-pushed the dev/dhoeflin/remove_subgroup_size_requirement branch from da8f86a to 5b60051 Compare May 12, 2026 19:40
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants